Health check feature for virtual router#3575
Conversation
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✖centos7 ✔debian. JID-274 |
|
@anuragaw i like definitively this feature. how does it work? |
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✔centos7 ✔debian. JID-286 |
|
@blueorangutan test |
|
@anuragaw a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Trillian test result (tid-381)
|
b081006 to
9acaadd
Compare
|
Rebased and cleaned up UI, refactored into separate scripts and tested in and out thoroughly to fix some cases around internal lb vm related scripts. |
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✔centos7 ✔debian. JID-347 |
|
@blueorangutan test |
|
ping @rhtyd , @nvazquez , @DaanHoogland, @Spaceman1984, @shwstppr - ping for review. |
|
@blueorangutan package |
|
@blueorangutan test |
|
@blueorangutan package |
|
@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✖centos6 ✔centos7 ✔debian. JID-352 |
|
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Cc @Doni7722 . This looks pretty cool! The included checks may actually be enough for our use case - but inserting custom checks seems a bit inconvenient. If I understand correctly, we'd have to build our own VR image? Also, I don't think putting the checker script and temporary configs into /root is a good idea - scripts should live in a more standardised location like /usr/lib/cloudstack, while dynamic configs should go to /var/lib/cloudstack or similar. |
|
@onitake you can just build custom systemvm.iso, i.e. extract the iso, extract the tgz, add scripts to the "root/health-checks/" folder and package back to tgz, and repack iso. Or simply have automation that will connect via ssh to all VRS and drop file in the folder, if it's not already there. |
|
@andrijapanicsb Well, building a custom ISO is not exactly convenient either. But in most cases, the added preparation step and a build job on a CI system would be worth it. If you have to change checks often, this will be very inconvenient, however. Pushing changes to VRs directly is something we want to avoid, as this can lead to strange problems if something is missed. We did something like that in the past. But, paired with updating the image, it might be a feasible option. On the other hand, adding an API to inject custom scripts into VRs is (obviously) a big security risk. |
|
Trillian test result (tid-845)
|
|
looks ready for merge, rekicking travis |
We want to support more exhaustive health checks for VRs. This feature helps admins configuring health checks and also expands it's scope. There are two categories of health checks - basic and advanced (more expensive so should be run less frequently). The following checks have been added with a separate script -
Following global configs were added for configuring health checks:
• "router.health.checks.enabled" - If true, router health checks are allowed to be executed and read. If false, all scheduled checks and API calls for on demand checks are disabled. Default is true.
• "router.health.checks.basic.interval" - Interval in minutes at which basic router health checks are performed. If set to 0, no tests are scheduled. Default is 3 mins as per the existing monitor services.
• "router.health.checks.advanced.interval" - Interval in minutes at which advanced router health checks are performed. If set to 0, no tests are scheduled. Default value is 10 minutes .
• "router.health.checks.config .refresh.interval" - Interval in minutes at which router health checks config - such as scheduling intervals, excluded checks, etc is updated on virtual routers by the management server. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation for passed data. Default is 10 mins.
• "router.health.checks.results.fetch.interval" - Interval in minutes at which router health checks results are fetched by management server. On each result fetch, management server evaluates need to recreate VR as per configuration of router.health.checks.failures.to.recreate.vr. This value should be sufficiently high (like 2x) from the router.health.checks.basic.interval and router.health.checks.advanced.interval so that there is time between new results generation and fetch.
• "router.health.checks.failures.to.recreate.vr" - Health checks failures defined by this config are the checks that should cause router recreation. If empty the recreate is not attempted for any health check failure. Possible values are comma separated script names from systemvm’s /root/health_scripts/ (namely - cpu_usage_check.py, dhcp_check.py, disk_space_check.py, dns_check.py, gateways_check.py, haproxy_check.py, iptables_check.py, memory_usage_check.py, router_version_check.py), connectivity.test or services (namely - loadbalancing.service, webserver.service, dhcp.service)
• "router.health.checks.to.exclude" - Health checks that should be excluded when executing scheduled checks on the router. This can be a comma separated list of script names placed in the '/root/health_checks/' folder. Currently the following scripts are placed in default systemvm template - cpu_usage_check.py, disk_space_check.py, gateways_check.py, iptables_check.py, router_version_check.py, dhcp_check.py, dns_check.py, haproxy_check.py, memory_usage_check.py.
• "router.health.checks.free.disk.space.threshold" - Free disk space threshold (in MB) on VR below which the check is considered a failure. Default is 100MB.
• "router.health.checks.max.cpu.usage.threshold" - Max CPU Usage threshold as % above which check is considered a failure.
• "router.health.checks.max.memory.usage.threshold" - Max Memory Usage threshold as % above which check is considered a failure.
API Changes:
Additionally the feature looks into any executable script in /root/health_scripts/ directory and adds it's result as json output of the overall health checks config. This allows custom checks to be put in and custom systemvm templates can also support health checks.
UI shows router in alert state if health checks are failure.
The health checks can be manually triggered using new API added in the feature (CLI or UI both support this).
Description
Fixes: 3270
Types of changes
Screenshots (if appropriate):
How Has This Been Tested?
Integration tests, manually, CMK, UI
API Changes -
New parameters added to list routers-
And added new API - getRouterHealthCheckResults-